AITopics | quantization strategy

Collaborating Authors

quantization strategy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Zhang, Yu, Zhen, Hui-Ling, Yuan, Mingxuan, Yu, Bei

arXiv.org Artificial IntelligenceDec-8-2025

Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput. Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, including reasoning, language understanding, and generation (Achiam et al., 2023; Grattafiori et al., 2024; Liu et al., 2024; Adler et al., 2024).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.05811

Genre: Research Report > New Finding (0.46)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Layer-Wise High-Impact Parameter Ratio Optimization in Post-Training Quantization for Large Language Models

Pham, Cuong, Dung, Hoang Anh, Nguyen, Cuong C., Le, Trung, Carneiro, Gustavo, Do, Thanh-Toan

arXiv.org Artificial IntelligenceNov-25-2025

Large language models (LLMs) have significantly advanced natural language processing, but their massive parameter counts create substantial computational and memory challenges during deployment. Post-training quantization (PTQ) has emerged as a promising approach to mitigate these challenges with minimal overhead. While existing PTQ methods can effectively quantize LLMs, they experience substantial accuracy loss at extremely low bit-widths, primarily due to high-impact parameters that significantly influence quantization performance. Several approaches address these issues by identifying and retaining the high-impact parameters in FP16 format. However, they apply fixed ratios of high-impact parameters across all layers, overlooking layer-wise sensitivity variations. In this paper, we propose a quadratic optimization framework that determines layer-specific ratios of high-impact parameters while considering inter-layer dependencies. We quantize high-impact parameters to moderate bit-widths, which often result in negligible performance degradation in quantized LLMs, while the remaining parameters can be quantized to extremely low bit-widths. Under the same resource-constrained budget, this allows for preserving more high-impact parameters than methods that keep selecting a few in FP16 format. Additionally, the proposed framework allows us to leverage an advanced quantization method that often requires extensive learnable parameters solely for high-impact parameters, while applying a computationally efficient method to the rest. Our approach achieves an effective balance between computational efficiency and model accuracy while maintaining high performance compared to state-of-the-art methods.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.17801

Genre: Research Report > Promising Solution (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Wang, Wenjun, Cai, Shuo, Xie, Congkai, Feng, Mingfa, Zhang, Yiming, Li, Zhen, Yang, Kejing, Li, Ming, Cao, Jiannong, Yang, Hongxia

arXiv.org Artificial IntelligenceOct-20-2025

The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

large language model, machine learning, quantization, (16 more...)

arXiv.org Artificial Intelligence

2509.22536

Genre: Research Report > New Finding (0.88)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration

Liu, Tianqi, Fu, Kairui, Zhang, Shengyu, Fan, Wenyan, Du, Zhaocheng, Zhu, Jieming, Wu, Fan, Wu, Fei

arXiv.org Artificial IntelligenceOct-6-2025

With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.

artificial intelligence, machine learning, quantization, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746027.3755632

2510.03038

Country: Asia > China > Zhejiang Province (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

Ahmad, Muhammad, Mazher, Khurram, Akram, Saqib, Tameem, Ahmad, Nasir, Saad Bin

arXiv.org Artificial IntelligenceSep-15-2025

We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. We further integrate one particular technique from QuantX into the popular Llama.cpp framework and show its feasibility in terms of runtime compared to the mainstream quantization techniques from Llama.cpp. Lastly, this manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.

large language model, machine learning, quantization, (18 more...)

arXiv.org Artificial Intelligence

2505.07531

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.40)

Add feedback

SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Zhang, Jiaji, Sun, Ruichao, Zhao, Hailiang, Wu, Jiaju, Chen, Peng, Li, Hao, Liu, Yuying, Chow, Kingsum, Xiong, Gang, Deng, Shuiguang

arXiv.org Artificial IntelligenceAug-28-2025

Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.

artificial intelligence, machine learning, quantization, (15 more...)

arXiv.org Artificial Intelligence

2507.14811

Country:

Europe > Italy (0.28)
Asia (0.28)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (1.00)
Media > Photography (0.68)
Consumer Products & Services (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FedHQ: Hybrid Runtime Quantization for Federated Learning

Zheng, Zihao, Wang, Ziyao, Cui, Xiuping, Li, Maoliang, Chen, Jiayu, Yun, null, Liang, null, Li, Ang, Chen, Xiang

arXiv.org Artificial IntelligenceMay-20-2025

Federated Learning (FL) is a decentralized model training approach that preserves data privacy but struggles with low efficiency. Quantization, a powerful training optimization technique, has been widely explored for integration into FL. However, many studies fail to consider the distinct performance attribution between particular quantization strategies, such as post-training quantization (PTQ) or quantization-aware training (QAT). As a result, existing FL quantization methods rely solely on either PTQ or QAT, optimizing for speed or accuracy while compromising the other. To efficiently accelerate FL and maintain distributed convergence accuracy across various FL settings, this paper proposes a hybrid quantitation approach combining PTQ and QAT for FL systems. We conduct case studies to validate the effectiveness of using hybrid quantization in FL. To solve the difficulty of modeling speed and accuracy caused by device and data heterogeneity, we propose a hardware-related analysis and data-distribution-related analysis to help identify the trade-off boundaries for strategy selection. Based on these, we proposed a novel framework named FedHQ to automatically adopt optimal hybrid strategy allocation for FL systems. Specifically, FedHQ develops a coarse-grained global initialization and fine-grained ML-based adjustment to ensure efficiency and robustness. Experiments show that FedHQ achieves up to 2.47x times training acceleration and up to 11.15% accuracy improvement and negligible extra overhead.

artificial intelligence, machine learning, quantization strategy, (13 more...)

arXiv.org Artificial Intelligence

2505.11982

Country:

North America > United States > Maryland (0.04)
North America > Canada (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

A probabilistic framework for dynamic quantization

Santini, Gabriele, Paissan, Francesco, Farella, Elisabetta

arXiv.org Artificial IntelligenceMay-19-2025

We propose a probabilistic framework for dynamic quantization of neural networks that allows for a computationally efficient input-adaptive rescaling of the quantization parameters. Our framework applies a probabilistic model to the network's pre-activations through a lightweight surrogate, enabling the adaptive adjustment of the quantization parameters on a per-input basis without significant memory overhead. We validate our approach on a set of popular computer vision tasks and models, observing only a negligible loss in performance. Our method strikes the best performance and computational overhead tradeoff compared to standard quantization strategies.

artificial intelligence, machine learning, quantization, (20 more...)

arXiv.org Artificial Intelligence

2505.10689

Country: Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness

Zheng, Zihao, Cui, Xiuping, Zheng, Size, Li, Maoliang, Chen, Jiayu, Yun, null, Liang, null, Chen, Xiang

arXiv.org Artificial IntelligenceMar-26-2025

However, the parameter density of LLMs has struggled to keep pace with the diverse and increasing volumes of data to be processed. To address this limitation, the Mix-of-Experts (MoE) has emerged as one of the most promising LLM implementation approach [1]. An MoE model contains multiple "expert" networks, which consist of individual models or specialized layers. And each expert is trained to fit into a different aspect of the data. When deployed in a particular inference scenario, the MoE dynamically selects a subset of these experts to be sparsely activated, allowing the MoE to synthesize the corresponding data distribution [2-4]. Although MoE models demonstrate improved performance in terms of parameter scalability and memory efficiency with sparse activation, it still faces the need for parameter compression [5, 6]. As revealed by a large number of LLM compression studies, quantization has proven to be the most efficient compression method, which reduces model volume by refactoring parameters into low-precision numbers [7]. While, with the development of quantization techniques, the focus of methodology has gradually shifted from the parameters themselves to the mapping relationship between the parameters and the complex data inputs. Some methods, such as GPTQ [8], start to leverage data distribution analysis for establishing a data-parameter mapping to guide iterative channel-wise parameter quantization; And later methods further examine the relative data scale as well as its impact on data-parameter correlation and highlight the significant variation of parameters (e.g., SmoothQuant [9], A WQ [10]), thus achieving mixed precision quantization with better performance (e.g., Atom [11]).

expert model, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2503.21135

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
Europe > France (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size

Behtash, Alireza, Fofonjka, Marijan, Baird, Ethan, Mauer, Tyler, Moghimifam, Hossein, Stout, David, Dennison, Joel

arXiv.org Artificial IntelligenceMar-7-2025

We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures -- from 1.6B to 70B parameters -- and showcase consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.

large language model, machine learning, quantization, (18 more...)

arXiv.org Artificial Intelligence

2503.04704

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback